Let’s first take a high-level look at the data to see what we’re dealing with:
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
Looks like we have almost 5 thousand observations containing 11 attributes each (X seems to be a duplicate of the index, so we can probably delete it altogether; quality is the output variable, produced based on those 11 attributes by professional wine judges).
Let’s delete the X variable and run the ncol() function to make sure it worked:
## [1] 12
It did work, so we’re ready to move on.
Let’s build a histogram of wine quality to see what wine made its way into this dataset:
Most ratings seem average (peaking at 6), with quite few excellent and poor wines. It’d be interesting to see how many samples actually got grades of 4 and under or 8 and over.
Poor wines:
## [1] 183
Excellent wines:
## [1] 180
Only slightly more than 7% (363 out of 4898) of all wine samples were given “extreme” grades! Further on throughout the analysis, we’ll label wine samples with grades of 4 and under as poor and with grades of 8 and over as excellent.
Let’s now explore how much alcohol our wine samples contain:
The most common alcohol percentage is around 9.4, with what looks like another peak in the 12-12.4 area, but it might be more insightful to subset our data to see how alcohol percentages are distributed among poor and excellent wines.
At first glance, excellent wines tend to contain more alcohol than poor ones. However, to be more certain, we might want to compute correlation between the quality and alcohol content, which is exactly what we’ll be doing in the Bivariate Plots section of the analysis.
Time to move on to how much residual sugar and salt (that is, chlorides) our wine samples contain. Let’s start with residual sugar:
Apparently, it makes sense to set some limits on the X axis and pick a more granular binwidth.
There’s a spike between 1 and 2 - let’s go even more microscopic and take a closer look at it.
Seems like values are distributed more or less normally in this range, with peaks at 1.2 and 1.4 and a little right tail.
On the whole, we can see a pronounced positive skew in the distribution of residual sugar, so we could try log-transforming this data to get rid of the right tail.
Looks better! The distribution has become bimodal, with the peaks at about 1.3 and 8.
The description that comes with this dataset says that wines with less than 1 gram of residual sugar per liter are quite rare, but the initial histogram we built clearly indicates we have some. Let’s see what those wine samples are:
## [1] 77
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 55 6.8 0.20 0.59 0.9 0.147
## 173 7.6 0.48 0.37 0.8 0.037
## 210 6.1 0.40 0.31 0.9 0.048
## 224 6.5 0.19 0.30 0.8 0.043
## 260 5.8 0.36 0.38 0.9 0.037
## 302 8.3 0.20 0.35 0.9 0.050
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## 55 38 132 0.9930 3.05 0.38
## 173 4 100 0.9902 3.03 0.39
## 210 23 170 0.9930 3.22 0.77
## 224 33 144 0.9936 3.42 0.39
## 260 3 75 0.9904 3.28 0.34
## 302 12 74 0.9920 3.13 0.38
## alcohol quality
## 55 9.1 6
## 173 11.4 4
## 210 9.5 6
## 224 9.1 6
## 260 11.4 4
## 302 10.5 6
Since we’re mostly interested in the quality, it might be a good idea to find out what grades those wines were given.
Most of them showed average results (5 and 6), with a few notable exceptions (3-point and 8-point wines).
Wines containing more than 45 grams of residual sugar per liter are considered sweet. In fact, in our initial histogram, one wine sample with over 60 grams of sugar (wow, that must be really sweet!) jumps right out at you:
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 2782 7.8 0.965 0.6 65.8 0.074
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## 2782 8 160 1.03898 3.39 0.69
## alcohol quality
## 2782 11.7 6
Despite its sweetness, it fared pretty well and scored a solid 6 from the judges!
Let’s now explore residual sugar levels in poor and excellent wines:
The modes here are 1 (for poor wines) and 2 (for excellent wines) - doesn’t look like much of a difference. Both the distributions have heavy right tails, each special in its own way: in case of poor wines, the tail is a little longer, whereas with excellent wines there’s a sudden spike at the end of the tail (at 14, to be precise).
Wine tasting notes also contain the word ‘saline’, which refers to how salty a wine is. But can a wine really be salty and how does it affect its quality? Let’s try to find this out by examining the next attribute - chlorides - which indicates the amount of salt in the wine.
Same as with residual sugar, distribution of chlorides is positively skewed, so log-transforming the data might help us solve this issue:
As we can see, there’s a pronounced peak at around 0.048, with the bulk of the data lying between 0.03 and 0.06.
Let’s also take a look at the wine samples with the extreme values:
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 3774 5 0.61 0.12 1.3 0.009
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## 3774 65 100 0.9874 3.26 0.37
## alcohol quality
## 3774 13.5 5
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 485 6.2 0.37 0.3 6.6 0.346
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## 485 79 200 0.9954 3.29 0.58
## alcohol quality
## 485 9.6 5
Both of those were given 5 by the judges, not very impressive.
It’s curious to see what level of chlorides poor and excellent wines actually have:
Almost the entire subset of excellent wines is situated below the point of 0.05, while the poor wine samples are more spread out and the distribution has a long right tail. So, tentatively, wines with higher grades tend to have a lower level of chlorides and therefore be less salty.
Moving on to pH, which describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic).
Most of the wine samples fall in the 3-3.3 range, peaking at 3.22. It would be beneficial to compare pH distributions for excellent and poor wines side by side.
At first glance, no drastic differences that catch the eye. We’ll need to examine the correlation between the pH level and the wine quality more closely in the Bivariate Plots section to be able to draw any conclusions.
Time to explore density. According to the dataset description, wine density is usually close to that of water and largely depends on alcohol content and residual sugar in a particular wine.
We can see a couple of outliers to the right, but the data is densely packed under the value of 1 - zooming in a bit might offer some more insight:
Most of our samples seem to have density that is a bit lower than that of water - 0.992.
Same as with pH, we’re going to examine density of poor and excellent wines separately.
0.993 for poor wines vs 0.991 for excellent wines - to establish whether this difference in peak values is significant, some statistical tests might be necessary, but that’s beyond the scope of this analysis.
In small quantities, it can add ‘freshness’ and flavor to wines. I wonder how much citric acid our wine samples contain.
The data seems normally distributed, with an unexpected spike at around 0.49. Let’s zoom in to see what’s going on there.
There are many wine samples to the right of the mode with the value of 0.49. I wonder what grades those wines got:
Surprisingly, the grades run the gamut from 4 to 9. However, most wine samples that contain 0.49 grams of citric acid per liter scored 6.
Let’s turn to examining concentration of citric acid in poor and excellent wines separately:
The bulk of excellent wines lies between 0.25 and 0.35, whereas in case of poor wines the samples are more spread out, most of them falling in the 0.05-0.5 range. The modes of the two distributions are almost equal: 0.25 for poor wines vs 0.3 for excellent ones.
Next up is sulphates, added to wines as an antimicrobial and antioxidant.
The distribution has a long right tail, so - again - log-transforming the data might help fix this issue:
Now it looks more normally distributed, with a pronounced peak at around 0.53. The initial histogram exposes a few outliers that have more than 1 gram of sulphates per liter - let’s look at those:
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 2442 7.2 0.20 0.28 1.6 0.028
## 4583 6.3 0.37 0.51 6.3 0.048
## 4887 6.2 0.21 0.28 5.7 0.028
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## 2442 13 168 0.99203 3.17 1.06
## 4583 35 146 0.99430 3.10 1.01
## 4887 45 121 0.99168 3.21 1.08
## alcohol quality
## 2442 11.50 6
## 4583 10.50 6
## 4887 12.15 7
Those wines did pretty well, scoring 6 and 7.
As usual, we’ll now examine concentration of sulphates in poor and excellent wines:
The bulk of poor wine samples seems to be more tightly packed, whereas excellent wine samples look more spread out, peaking at 0.45 and 0.4, respectively.
In this subsection, we’ll be looking at concentration of tartaric and acetic acids. The latter, at too high levels, can make a wine taste like vinegar.
We might benefit from a more granular histogram here:
Now it’s easier to see the peak value, which is around 6.5.
In the initial histogram, some outliers are immediately obvious. We’ll examine a few of them more closely:
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 208 10.2 0.44 0.88 6.2 0.049
## 874 10.3 0.17 0.47 1.4 0.037
## 1240 10.3 0.25 0.48 2.2 0.042
## 1373 10.7 0.22 0.56 8.2 0.044
## 1374 10.7 0.22 0.56 8.2 0.044
## 1527 14.2 0.27 0.49 1.1 0.037
## 2051 11.8 0.23 0.38 11.1 0.034
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## 208 20 124 0.9968 2.99 0.51
## 874 5 33 0.9939 2.89 0.28
## 1240 28 164 0.9980 3.19 0.59
## 1373 37 181 0.9980 2.87 0.68
## 1374 37 181 0.9980 2.87 0.68
## 1527 33 156 0.9920 3.15 0.54
## 2051 15 123 0.9997 2.93 0.55
## alcohol quality
## 208 9.9 4
## 874 9.6 3
## 1240 9.7 5
## 1373 9.5 6
## 1374 9.5 6
## 1527 11.1 6
## 2051 9.7 3
Almost half of those are poor wine samples and the rest are of medium quality (5 and 6). To check if wine quality drops as tartaric acid concentration increases, we might want to compare this concentration in poor and excellent wines, so that we can draw some tentative conclusion:
Both the peak values seem equal to 7, although the distribution of poor wine samples is more right-skewed, whereas that of excellent wine samples is left-skewed.
We’ll analyze acetic acid the same way as we did tartaric acid.
This distribution has a long right tail, so we can proceed in two ways: just chop the tail off by applying the limit to the X axis, or log-transform our data. Let’s try to do both for a change and see what results we end up with.
We get almost the same peak value of about 0.28, although it’s a bit off to the left (by circa 0.01) in case of log10 transform.
Let’s now take a closer look at some of the outliers that contain over 0.9 grams of acetic acid per liter:
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 373 6.6 0.905 0.19 0.8 0.048
## 1857 10.0 0.910 0.42 1.6 0.056
## 1952 9.9 1.005 0.46 1.4 0.046
## 2155 9.8 0.930 0.45 8.6 0.052
## 2782 7.8 0.965 0.60 65.8 0.074
## 4040 6.1 1.100 0.16 4.4 0.033
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## 373 17 204 0.99340 3.34 0.56
## 1857 34 181 0.99680 3.11 0.46
## 1952 34 185 0.99660 3.02 0.49
## 2155 34 187 0.99940 3.12 0.59
## 2782 8 160 1.03898 3.39 0.69
## 4040 8 109 0.99058 3.35 0.47
## alcohol quality
## 373 10.0 5
## 1857 10.0 4
## 1952 10.2 4
## 2155 10.2 4
## 2782 11.7 6
## 4040 12.4 4
We can see those are poor to medium wines, which seems to be in line with the above statement from the dataset description that claims that higher concentrations of this acid lead to a pronounced taste of vinegar in a wine sample. Of course, to draw any conclusions, a more in-depth analysis is needed, which we’ll undertake in the next sections. For now, we’ll try to find out what concentrations of acetic acid are typical of poor and excellent wines.
Peak values are almost equal, although the figure seems to be a bit greater for poor wines (0.28 vs 0.26). Moreover, the distribution of poor wines has a longer right tail that extends beyond higher values than that of excellent wines.
Here, we’ll be exploring SO2 levels in our wine samples, starting off with free SO2.
Almost all the wine samples sit under the value of 100, so we might want to zoom in a bit:
This distribution seems to peak at a value close to 30.
Let’s also examine the outlier situated far off to the right in the first histogram:
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 4746 6.1 0.26 0.25 2.9 0.047
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## 4746 289 440 0.99314 3.44 0.64
## alcohol quality
## 4746 10.5 3
Turns out it’s quite a low-quality wine sample.
Time to compare how much free SO2 is contained in poor and excellent wines:
Looks like lower levels of free SO2 are more typical of poor wines, with the peak at 2 and the bulk of the data sitting between 2 and 37. For excellent wines, most wine samples fall in the 27-47 range, with the peak at 29. One more interesting thing: the poor wines distribution here is the only one so far that looks like an exponential one, which means higher levels of free SO2 are much more rarely observed in poor wines than low ones.
Since total SO2 = free SO2 + bound SO2, we can create a new variable for bound SO2 and analyze it separately, same as we did with free SO2.
Let’s see some high-level information about our new variable:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 78.0 100.0 103.1 125.0 331.0
Now we can build a few histograms to better understand how it behaves.
We can clearly see an outlier containing over 300 mg/l of bound SO2 - let’s go microscopic on it:
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1418 8.6 0.55 0.35 15.55 0.057
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## 1418 35.5 366.5 1.0001 3.04 0.63
## alcohol quality bound.sulfur.dioxide
## 1418 11 3 331
Same as with free SO2 above, it’s also a low-quality wine, with a score of 3.
Now we’ll apply some breaks and limits to the X axis to be able to see separate values more clearly:
The peak’s become more discernible - it’s about 82.
Moving on to comparing the levels of bound SO2 in poor and excellent wines.
The peaks for the poor wines and excellent wines distributions are about 104 and 76, respectively. The bulk of the poor wine samples falls in a wider range between 44 and 168, whereas it’s just between 56 and 112 for excellent wines. Based only on this quick visual comparison, we can tentatively say that excellent wines tend to contain less bound SO2 than poor wines do.
The last to be analyzed in this subsection is the total level of SO2, which, I assume, must strongly correlate with both the level of free SO2 and the level of bound SO2 since it’s just the sum of the two. Let’s see if the total SO2 histograms we build are very different from what we had for free and bound SO2.
This histogram also shows an outlier with a pretty high total level of SO2. I believe it’s the same wine sample that we looked at above, when dealing with either free or bound SO2, but let’s check it to be sure:
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 4746 6.1 0.26 0.25 2.9 0.047
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## 4746 289 440 0.99314 3.44 0.64
## alcohol quality bound.sulfur.dioxide
## 4746 10.5 3 151
Indeed, it’s the very same low-quality wine sample that we picked out earlier in the analysis.
Same as with free and bound SO2, let’s build a more granular histogram to see more clearly how the values are distributed:
Judging by the histogram, the mode of this distribution is somewhere near 114.
What’s left for us is to explore the total levels of SO2 across poor and excellent wines:
The poor wines histogram peaks at 109 and then at 189, whereas the excellent wines histogram shows two distinct peaks situated fairly close to each other - at 99 and 119. Also, poor wine samples are more spread out across the X axis, and the poor wines distribution seems to have a left tail.
The dataset contains 4898 wine samples with 11 attributes (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol; I’m not counting the X variable here, since, like I said above, it’s just a duplicate of the index, so I dropped it from the dataset before going on with the analysis) and a final grade each sample received from the professional wine judges based on those attributes.
The main feature is quality, because this whole analysis is driven by the question “what influences the quality of white wine?”. In the next two sections, I’m going to focus on exploring the relationships between quality and other features and their combinations.
Based on the univariate analysis I’ve performed so far, I have a reason to believe features like alcohol content, chlorides, levels of SO2, fixed and volatile acidity might be more or less reliable indicators of wine quality, but it’s hard to say anything for sure until bivariate and multivariate analyses are carried out and feature relationships are explored in various ways.
Since I had data on both free SO2 and total SO2 in wines, I created a new variable called bound sulfur dioxide (SO2) by subtracting free SO2 from total SO2. I’m yet to analyze this variable closer in the next sections of my analysis, but for now it seems like higher-quality wine samples tend to contain slightly lower levels of bound SO2.
I didn’t have to clean anything or fill any gaps as this dataset is prepared in such a way that there’s no missing data in it.
As I mentioned above, at the beginning of my analysis, I got rid of the X variable since it was just a duplicate of the index and didn’t help me in any way.
All the remaining features in the dataset seem more or less normally distributed, but some of the distributions are positively skewed: residual sugar, chlorides, sulphates, volatile acidity. I log-transformed (log10) all of them to solve the issue of long tails. In fact, the residual sugar distribution turned out to be bimodal, with peaks at 1.3 and 8.
One thing I noticed is that concentration of almost all the substances is given in grams per cubic decimeter (or grams per liter, which is the same thing, and I prefer this notation), with a notable exception of levels of SO2, which are given in milligrams per liter. Further down the road, it might be worthwhile to convert them to grams per liter to see if anything changes. Same story with density, which can later be converted from grams per cubic centimeter to grams per liter to see if that transformation brings anything new and unexpected to the analysis.
In this section, relationships between pairs of features will be examined. One such relationship is correlation, and the quickest way to obtain pairwise correlations for the whole dataset is to use a ggpairs() function from a library called GGally.
We can see a more or less pronounced (I defined the threshold to be abs(0.35)) correlation between the following pairs:
Positive:
Negative:
Our main variable, quality, is correlated the most with alcohol (0.436), density (-0.307), bound sulfur dioxide (-0.218), chlorides (-0.21), and volatile acidity (-0.195).
It would make sense to concentrate our efforts on studying the identified correlated pairs more carefully.
In this subsection, we’ll look at how quality varies with the rest of the features and try to find out if any feature allows definitely telling a good wine from a bad one.
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.200 6.575 7.300 7.600 8.525 11.800
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.800 6.400 6.900 7.129 7.600 10.200
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.500 6.400 6.800 6.934 7.400 10.300
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.838 7.300 14.200
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.200 6.200 6.700 6.735 7.200 9.200
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.900 6.200 6.800 6.657 7.300 8.200
## --------------------------------------------------------
## wine$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.60 6.90 7.10 7.42 7.40 9.10
Best wines in the dataset have the highest minimum fixed acidity (6.6) and one of the highest medians, along with the worst wines. At the same time, wine samples rated 8 or 9 have the two lowest maximum levels of fixed acidity - 8.2 and 9.1.
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1700 0.2375 0.2600 0.3332 0.4125 0.6400
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1100 0.2700 0.3200 0.3812 0.4600 1.1000
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.100 0.240 0.280 0.302 0.340 0.905
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2000 0.2500 0.2606 0.3000 0.9650
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.1900 0.2500 0.2628 0.3200 0.7600
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.2000 0.2600 0.2774 0.3300 0.6600
## --------------------------------------------------------
## wine$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.240 0.260 0.270 0.298 0.360 0.360
This box plot shows something of a wave-like pattern in terms of median volatile acidity: it starts growing, reaches its peak, then hits the bottom at 6 and 7, and grows again towards the best wine samples. One thing to note here is that the best wine samples have the highest minimum (0.24) and the lowest maximum (0.36) values of volatile acidity.
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2100 0.2575 0.3450 0.3360 0.3850 0.4700
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1900 0.2900 0.3042 0.4000 0.8800
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2400 0.3200 0.3377 0.4100 1.0000
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.270 0.320 0.338 0.380 1.660
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.2800 0.3100 0.3256 0.3600 0.7400
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0400 0.2800 0.3200 0.3265 0.3600 0.7400
## --------------------------------------------------------
## wine$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.290 0.340 0.360 0.386 0.450 0.490
Median level of citric acid doesn’t seem to vary too much across different grades, except for two spikes at 3 and 9. Best and worst wine samples have the highest minimum (.29 and 0.21, respectively) and lowest maximum values (0.49 and 0.47.respectively) of citric acid concentration.
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.588 4.600 6.392 10.700 16.200
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.300 2.500 4.628 7.100 17.550
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.800 7.000 7.335 11.500 23.500
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.700 5.300 6.442 9.900 65.800
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.700 3.650 5.186 7.325 19.250
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.800 2.100 4.300 5.671 8.200 14.800
## --------------------------------------------------------
## wine$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.60 2.00 2.20 4.12 4.20 10.60
We can see pronounced fluctuations of the median level of residual sugar across the wine grades, the highest being 7 (grade 5) and the lowest 2.2 (grade 9). However, the sweetest wine sample in the dataset (65.8) has a grade of 6. Once again, the best wine samples have the highest minimum (1.6) and lowest maximum (10.6) levels of residual sugar.
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02200 0.03625 0.04100 0.05430 0.05400 0.24400
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0130 0.0380 0.0460 0.0501 0.0540 0.2900
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.04000 0.04700 0.05155 0.05300 0.34600
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01500 0.03600 0.04300 0.04522 0.04900 0.25500
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.03100 0.03700 0.03819 0.04400 0.13500
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01400 0.03000 0.03600 0.03831 0.04400 0.12100
## --------------------------------------------------------
## wine$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0180 0.0210 0.0310 0.0274 0.0320 0.0350
Wine samples graded 5 have the highest median level of chlorides (0.047), and after that concentration goes downward and hits the bottom at grade 9 - 0.0274. The best wine samples also have the lowest maximum level of chlorides, which is 0.035 - it’s at least 3.5 times lower than the runner-up (0.121) and almost 10 times lower than the greatest value in the dataset (0.346).
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 13.25 33.50 53.32 47.50 289.00
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 9.00 18.00 23.36 30.50 138.50
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 22.00 35.00 36.43 50.00 131.00
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 24.00 34.00 35.65 46.00 112.00
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 25.00 33.00 34.13 41.00 108.00
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 28.00 35.00 36.72 44.50 105.00
## --------------------------------------------------------
## wine$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 24.0 27.0 28.0 33.4 31.0 57.0
Wine samples with the grade of 4 seem to have the lowest median level of free SO2 among all and the second greatest maximum level of free SO2 (138.5), topped only by wine samples graded 3 (max value - 289). Most values seem to be lying below the threshold of 50 (the maximum third quartile), above which SO2 might become evident in the nose and taste of wine and influence its quality.
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.0 82.5 106.0 117.3 152.2 331.0
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.0 67.5 102.0 101.9 133.8 195.0
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 91.0 114.0 114.5 137.0 293.5
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.0 76.0 97.0 101.4 123.0 243.0
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22.00 71.00 86.00 90.99 106.00 199.00
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 42.00 71.00 84.00 89.45 104.50 159.50
## --------------------------------------------------------
## wine$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 61.0 62.0 82.0 82.6 96.0 112.0
Median bound SO2 level of better wines tends to lie below 100 (true for grades 6 through 9), which seems to be in line with the negative correlation (about -0.2) we’ve discovered earlier, and it reaches the lowest value of 82 at grade 9. As was to be expected, the worst wines have the highest maximum level of bound SO2 (331), whereas the best ones have the lowest maximum level, 112 - a 3-time difference!
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.0 105.8 159.5 170.6 210.0 440.0
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.0 85.0 117.0 125.3 171.5 272.0
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 121.0 151.0 150.9 182.0 344.0
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.0 107.2 132.0 137.0 164.0 294.0
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34.0 101.0 122.0 125.1 144.2 229.0
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 59.0 102.5 122.0 126.2 150.0 212.5
## --------------------------------------------------------
## wine$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 85 113 119 116 124 139
Since total SO2 = free SO2 + bound SO2, we can observe the same patterns as above, when we analyzed levels of free and bound SO2 in wines. For example, the median level of total SO2 tends to lie below 150 for better wines, with a notable exception of grade 4, which has the lowest median value of all - 117 (heavily influenced by the low level of free SO2 for this grade).
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9911 0.9925 0.9944 0.9949 0.9969 1.0000
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9892 0.9926 0.9941 0.9943 0.9958 1.0000
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9872 0.9933 0.9953 0.9953 0.9972 1.0020
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9876 0.9917 0.9937 0.9940 0.9959 1.0390
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9906 0.9918 0.9925 0.9937 1.0000
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9903 0.9916 0.9922 0.9935 1.0010
## --------------------------------------------------------
## wine$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9896 0.9898 0.9903 0.9915 0.9906 0.9970
The median density tends to decrease as the quality grows, the only group that breaks this trend is wine samples of grade 5, which have the highest median density of all - 0.9953. This finding seems to be in line with what we’ve discovered previously: to refresh our memory, quality vs density is the strongest negative correlation for our main variable (-0.307).
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.870 3.035 3.215 3.188 3.325 3.550
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.830 3.070 3.160 3.183 3.280 3.720
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.790 3.080 3.160 3.169 3.240 3.790
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.080 3.180 3.189 3.280 3.810
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.840 3.100 3.200 3.214 3.320 3.820
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.940 3.120 3.230 3.219 3.330 3.590
## --------------------------------------------------------
## wine$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.200 3.280 3.280 3.308 3.370 3.410
Most wine samples lie in the range between 3 and 3.3 on the pH scale, and the median values fit into an even narrower range of 3.15-3.3, varying in a slightly discernible bowl-like fashion: it all starts at 3.215, gradually falls to 3.16, and then starts growing again, peaking at 3.28 (wine samples graded 9).
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2800 0.3800 0.4400 0.4745 0.5425 0.7400
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2500 0.3800 0.4700 0.4761 0.5400 0.8700
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2700 0.4200 0.4700 0.4822 0.5300 0.8800
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2300 0.4100 0.4800 0.4911 0.5500 1.0600
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4800 0.5031 0.5800 1.0800
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2500 0.3800 0.4600 0.4862 0.5850 0.9500
## --------------------------------------------------------
## wine$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.360 0.420 0.460 0.466 0.480 0.610
All the median values sit between 0.4 and 0.5, with the worst and best wine samples having the lowest median (as well as maximum) levels of sulphates. Otherwise, there’s little change across median pH values.
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.55 10.45 10.34 11.00 12.60
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.40 10.10 10.15 10.75 13.50
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.000 9.200 9.500 9.809 10.300 13.600
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 9.60 10.50 10.58 11.40 14.00
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.60 10.60 11.40 11.37 12.30 14.20
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 11.00 12.00 11.64 12.60 14.00
## --------------------------------------------------------
## wine$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 12.40 12.50 12.18 12.70 12.90
This box plot reinforces our earlier finding saying that there’s a strong positive correlation (0.436) between alcohol content and wine quality - turns out it’s especially true for wine samples of higher grades, whereas for lower-quality wines the trend is actually downward - the quality improves with lower alcohol levels. The best wines have the highest median level of alcohol, which seems to be significantly different from some of the other median values of lower-quality wines. It means that in this case the median can be used as a more or less reliable predictor of wine quality: if it’s below 12, a wine sample couldn’t have scored more than 7. Of course, such conclusions are restricted to our dataset only - the situation might be quite different for the whole population of wines.
Let’s also build a few color-coded density plots for some of the features that formed the most strongly correlated pairs.
In all these plots, we can clearly see a bimodal distribution for the best wines. I guess this effect is due to there being very few wine samples with grade 9 in the dataset that take on just several values. Our analysis might have benefited from a greater number of highest-quality wines, as we could’ve checked whether this pronounced bimodality has to do with insufficient data or there’re some other factors at play.
As for the last density plot for residual sugar, the distributions seem quite skewed - and indeed, in the first section of this analysis, we’ve found out that the residual sugar distribution has a very heavy right tail. Let’s now try rebuilding the same density plot, but with the residual sugar variable log-transformed.
Now it becomes obvious this distribution is actually bimodal across all wine grades! Pretty curious finding that I can’t explain right away for the lack of the domain knowledge. It might even be a phenomenon peculiar to Portuguese wines - it’s really hard to tell without having more data handy.
The strongest positive correlation involving quality is quality vs alcohol (0.436). One particularly interesting thing here is that an upward trend (quality increases as alcohol content grows) holds true only for higher-quality wines, starting from the grade of 6; below this point the trend is actually downward: for wine samples graded 3-5, the lower the alcohol level, the better the wine. The median alcohol value of less than 12 indicates that a wine sample’s maximum score is 7, which might help us tell a good wine sample from a poor one.
The most pronounced negative correlation that has to do with our main feature is observed in the pair quality - density (-0.307). The general trend there is a downward one: with each grade, median density decreases a bit, with a notable exception of one group - wine samples of grade 5, which break this trend and actually have the greatest median density of all grades. The exactly same picture can be seen in quality vs bound SO2 (-0.218): grade 5 wines once again break the generally downward trend.
Another interesting pattern was discovered in the pair quality vs volatile acidity (-0.195): median values there seemed to change in a wave-like fashion from one grade to another, going up and down a few times.
One more curious finding was that the residual sugar distribution, which is highly skewed initially, when log-transformed and color-coded by quality, is actually bimodal across all the wine grades, from lowest to highest. As I said above, under the relevant plot, I might be lacking some specialist knowledge to draw the right conclusion based on this fact, or it might just be a peculiarity of Portuguese wines, white ones in particular.
Fun fact: positive correlations were dominated by density (3 occurrences out of 6), negative ones by alcohol, featured even more prominently (5 occurrences out of 6). Therefore, it’s only natural that these two features produced the most highly correlated pairs (which I’m talking about in more detail in the subsection below), and density had a part in both of them!
Among other things, total SO2 and bound SO2 turned out to be positively correlated with both density and residual sugar. As for the negative correlation, one of the strongest relationships were observed in such pairs as: total SO2 and free SO2 vs alcohol; pH vs fixed acidity; alcohol vs residual sugar and chlorides.
Surprisingly enough, the two most pronounced correlations didn’t involve the main variable, quality, but instead featured density, which seems to be heavily dependent on both residual sugar and alcohol content. In the former case, the correlation is positive and equals 0.839; in the latter case, the features are negatively correlated (-0.78).
In the previous section, we used box plots to see how different variables are distributed across wine grades and scatter plots to discover interesting pairwise relationships between the features. This section allows us to take our analysis one step further by combining the two techniques and examining what relationships the features display (and how these relationships vary) across wine grades.
Let’s first take a look at a couple of scatter plots for the features that exhibited the strongest correlation, faceted by quality.
Looks like no surprises here. Scatter plots demonstrate the same trends across all wine quality grades: upward for density vs residual sugar and downward for density vs alcohol.
I wonder what plots would look like for less correlated features.
For the lowest-quality wines, alcohol doesn’t seem to be correlated with residual sugar at all, with a negative trend becoming more noticeable towards higher wine grades.
Somewhat similar picture here. In case of the worst and best wines, alcohol and total So2 are much less correlated (if correlated at all) as compared with wine samples of other grades, which all display a more prominent downward trend.
This time the weakest correlation between the features takes place with the best wine samples. In all other cases, an upward trend is obvious.
We’ll now build a pretty straightforward linear model to see how well it can predict wine quality based on the features we’ve analyzed.
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wine)
## m2: lm(formula = quality ~ alcohol + residual.sugar, data = wine)
## m3: lm(formula = quality ~ alcohol + residual.sugar + density, data = wine)
## m4: lm(formula = quality ~ alcohol + residual.sugar + density + volatile.acidity,
## data = wine)
## m5: lm(formula = quality ~ alcohol + residual.sugar + density + volatile.acidity +
## pH, data = wine)
## m6: lm(formula = quality ~ alcohol + residual.sugar + density + volatile.acidity +
## pH + sulphates, data = wine)
## m7: lm(formula = quality ~ alcohol + residual.sugar + density + volatile.acidity +
## pH + sulphates + free.sulfur.dioxide, data = wine)
##
## ===========================================================================================================
## m1 m2 m3 m4 m5 m6 m7
## -----------------------------------------------------------------------------------------------------------
## (Intercept) 2.582*** 2.021*** 90.313*** 74.225*** 97.650*** 116.300*** 111.166***
## (0.098) (0.117) (12.374) (11.977) (12.392) (12.713) (12.728)
## alcohol 0.313*** 0.354*** 0.246*** 0.286*** 0.253*** 0.232*** 0.244***
## (0.009) (0.010) (0.018) (0.018) (0.018) (0.019) (0.019)
## residual.sugar 0.022*** 0.053*** 0.052*** 0.064*** 0.071*** 0.066***
## (0.002) (0.005) (0.005) (0.005) (0.005) (0.005)
## density -87.886*** -71.546*** -96.535*** -115.304*** -110.262***
## (12.317) (11.923) (12.404) (12.729) (12.743)
## volatile.acidity -2.059*** -2.024*** -1.992*** -1.940***
## (0.109) (0.109) (0.108) (0.109)
## pH 0.528*** 0.490*** 0.462***
## (0.076) (0.076) (0.076)
## sulphates 0.605*** 0.571***
## (0.099) (0.099)
## free.sulfur.dioxide 0.003***
## (0.001)
## -----------------------------------------------------------------------------------------------------------
## R-squared 0.190 0.202 0.210 0.264 0.271 0.277 0.280
## adj. R-squared 0.190 0.202 0.210 0.263 0.270 0.276 0.279
## sigma 0.797 0.791 0.787 0.760 0.757 0.754 0.752
## F 1146.395 619.354 434.085 438.646 363.847 311.778 271.828
## p 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -5839.391 -5802.158 -5776.812 -5604.126 -5580.287 -5561.452 -5549.703
## Deviance 3112.257 3065.298 3033.737 2827.187 2799.800 2778.350 2765.053
## AIC 11684.782 11612.317 11563.624 11220.251 11174.574 11138.904 11117.407
## BIC 11704.272 11638.303 11596.107 11259.231 11220.050 11190.877 11175.876
## N 4898 4898 4898 4898 4898 4898 4898
## ===========================================================================================================
The variables in this linear model can account for 28% of the variance in the quality of white wine.
The most prominent correlations we’ve discovered were in fact so strong that, when faceted by wine quality, the features displayed the same trends across all wine grades: for density vs residual sugar, the trend was always upward, for density vs alcohol always downward.
For other, less correlated features (alcohol vs residual sugar, alcohol vs total SO2, density vstotal SO2), the trend across the wine grades was also the same, with an exception of best or worst wines, or both, whereby features showed little to no correlation whatsoever.
Since the correlation between density and residual sugar was quite higher than that of density and alcohol (0.839 vs -0.78), I was epsecially interested to see how residual sugar and alcohol were correlated and expected at least a slightly positive correlation. To my surprise, the correlation turned out to be strongly negative (-0.451, second strongest among negative correlations discovered); in fact, it was so strong that a negative downward trend manifested itself across 6 out 7 wine grades represented in the dataset, except for grade 3, where features showed no correlation at all.
Yes, I did create a linear model that makes a prediction based on 7 features from the dataset. Further increasing the number of features didn’t yield any significant improvement, so I stopped at this value. Surprisingly enough, the model explains a mere 28% of the variance in the target variable, which is quality. It seems like wine quality is not well supported by its physico-chemical properties. Two things to note here: first, quality of prediction could be improved with more data (right now, it’s less than 5,000 samples); second, there’re some other factors at play, so the model might have benefited from addition of such variables as price of wine, region where it was produced, year it was produced and other things not related to wine chemistry. Trying out other models may also lead to better results. Say, I have a hunch that tree-based methods would do well in this case.
This box plot supports our finding saying that the strongest positive correlation our main variable of interest is involved in is quality vs alcohol (0.436). An interesting thing here is that for lower wine grades, we can actually observe a negative downward trend that gets reversed only from grade 5 onwards. Thus, for wines of up to grade 5, the lower the alcohol content, the better a wine tends to be; after that wine quality grows linearly with increasing alcohol content.
Moreover, the median (and mean as well) alcohol content of best wines looks significantly different from that of worst wines, which can be used to more or less reliably tell a quality wine from a poor one.
When plotted unmodified, the residual sugar distribution is highly skewed and has a long right tail. However, when log-transformed, the distribution becomes bimodal. When I later color-coded the plot, I saw the distribution was in fact bimodal across all the wine grades. Intrigued by this phenomenon, I read a few specialized articles on residual sugar in wines, but couldn’t find any explanation that would satisfy me. Therefore I’m inclined to think, for the lack of proof to the contrary, that it’s just a regional thing specific to Portuguese wines.
This faceted scatter plot illustrates the third strongest negative correlation discovered during the analysis - alcohol vs total SO2. Each subplot contains a line of best fit that visually reinforces the trend across wine grades. One interesting observation here is that with best and worst wines, the features display little to no correlation whatsoever, whereas for wines of grades 4 through 8, a clearly negative downward trend manifests itself. It might be an indication of the fact that this particular combination of features is a bad candidate for predicting wine quality. Indeed, when I was building a linear model, alcohol turned out to be the best contributor to the overall quality of prediction, whereas total SO2 added absolutely nothing to improve it and therefore was not included in the resulting model.
The dataset I’ve analyzed contains information on almost 5,000 white wines across 11 variables plus the output variable based on sensory data, that is a grade on a scale of 0 to 10 given to each wine sample by professional wine judges. This dataset is restricted to Portuguese wines and contains only their physico-chemical properties.
I began my analysis by building histograms of each feature to understand their distribution. They turned out to be normally distributed, with a few notable exceptions (take residual sugar as an example), where I observed heavy skew and long tails. Log-transforming these variables helped me deal with this abnormality. I also defined thresholds for poor (grade 4 and under) and excellent (grade 8 and over) wines, then subset my dataset using these thresholds and plotted distributions of individual features across poor and excellent wines side by side. This helped me see whether these distributions were very different and identify a few potential candidates that could be useful in telling a low-quality wine from a better one.
I went on to explore pairwise relationships between the features and pick out the most strongly correlated (both positively and negatively) pairs to focus my analysis on them. To my surprise, the main variable of interest - quality - wasn’t involved in any of the strongest correlations identified. I built a few scatter plots and included a line of best fit for each of them to more clearly see the general trend in the data points. Then I added a few box plots that reinforced my earlier findings and offered some new insights.
My greatest success was finding out that alcohol content was the most influential feature that could more or less reliably be used to differentiate between poor and excellent wines. Indeed, when I later built a linear model to predict a wine grade, this feature alone contributed over 70% to the overall prediction quality.
In the final part of my analysis, I used wine grades to color-code and facet a few plots that I’d built previously to see if any variables reinforce each other across any of the wine grades. The main finding here was that in the two most strongly correlated pairs the corelation was so pronounced that the trend stayed the same across all wine grades: it was always upward for density vs residual sugar and downward for alcohol vs density. The situation was a bit different for more weakly correlated pairs: the trend did stay the same across most wine grades, but with worst or best wines, the features I was analyzing displayed little to no correlation at all (for example, alcohol vs total SO2), which signaled these combinations were probably not the best predictors of wine quality. I tested these findings when building a linear model and excluded the worst contributors from the final version.
I’ve also bumped into a couple of obstacles along the way. First, I found out that the residual sugar distribution, when log-transformed, is bimodal across all wine grades. I’ve been struggling to explain this phenomenon for some time and even read a few specialized articles on the topic, but found no satisfactory explanation so far. So I’m inclined to believe this phenomenon is specific to Portuguese wines, since that’s what I’ve been analyzing all along.
Another thing I had difficulties with was the linear model that I’d built. It was able to explain only 28% of the variance in wine quality, which I found to be a pretty poor result. At first, I thought I was doing something wrong and actually spent a couple of days trying to engineer new features and combine them in various ways (to no avail), but then I realized that some other factors were at play and physico-chemical properties alone were not enough of a quality predictor.
And this realization leads me to suggestions on how to improve this analysis. First and foremost, more data would be nice. 5,000 wine samples is alright, but given the number of wines in the world, it’s just a drop in the ocean. Besides, the dataset is restricted to only Portuguese wines, which significantly limits its value and ability to represent the whole population. Second, as I mentioned above, there must be some other features that heavily influence wine quality. Better results might have been obtained if we had information about a region where a wine was produced, the year it was produced, grape type, selling price and wine brand, to name a few. Also, it might be a good idea to test other kinds of models and see how they fare against each other. I guess more powerful models, like SVM or tree-based methods, could have demonstrated impressive results.